Note. We have a working script prepared for you to put your answers in, called “Working Script Practical 2.Rmd.
This is the second practical, we will dive into more advanced matters: model selection, subject specific parameters, model diagnostics, and model fit. Note that we cover a lot of ground here, most likely too much to finish within this practical. All parts are modular, so you are invited to work on what interests you most. We continue with the same dataset: the open access dataset from Rowland and Wenzel (2020), using the subset of the affective experiences happy, relaxed, anxious, and depressed, measured six times a day for 40 days in total.
The answers to the practicals are available in the
*_answers.html files.
In all practicals in this workshop, we make use of the
mHMMbayes package, version 1.1.0. You can load the package
like
library(mHMMbayes)
In this practical, we will also use the following packages:
# install.packages("coda")
library(ggplot2) # for plotting
library(ggmHMM) # plotting mHMM objects
library(coda) # for assessing convergence
We will also, again, make use of our helper functions
source('./helpers.R')
In this section, we obtain and inspect the subject specific parameters.
We first need to obtain the subject specific emission distribution
parameters, and save the subject specific parameters in an object. When
focusing on the emission distribution parameters, the function
obtain_emiss() can be used to obtain the subject specific
emission distribution parameters, by setting the input argument
level to "subject". Note that the default of
level is "group", so when left unspecified,
the function obtain_emiss() will return the group level
emission distribution parameters.
Exercise 1 Obtain the subject specific emission
distribution parameters using the function obtain_emiss()
and set the input argument level to "subject".
Save the output in the object emiss_subject. Inspect the
output.
# obtaining and saving the subject specific transition probabilities
emiss_subject <- obtain_emiss(out_2st_emotion, level = "subject")
Again, to develop an intuition on how much the emission distribution
parameters (e.g., the variable and state specific means) vary over
subjects, it can be helpful to plot the subject specific emission
distribution parameters, for example by adding the subject specific
parameters as dots to a group level barplot of the emission means. This
can be done using the ggmHMM function
plot_emiss().
Exercise 2 Plot (a subset of) the subject specific
emission distribution means, using the function
plot_emiss() from the ggmHMM package and
investigate how much the means differ over subjects. Hint: both
functions have the argument subject, which can be used to
specify a vector of subjects to plot random effects for.
plot_emiss(out_2st_emotion, subject_effects = TRUE, line = FALSE, subject = 1:50) # using ggmHMM function
In addition, it is important to check that the states represent the same construct over subjects. Working with our current dataset, that would mean that for each subject, state 1 represents a ‘good mood’ state, and state 2 represents a ‘bad mood’ state. For example, for the indicator happy, that would translate to the mean of state 1 being higher compared to state 2. Visually, this can be checked by connecting the subject specific means over the states within one indicator by lines.
Exercise 3 Plot (a subset of) the subject specific
emission distribution means connecting the subject specific means, and
visually investigate whether the state 1 is a relatively good state and
state 2 a relatively bad state over subjects. Hint: this can be done by
specifying line = TRUE in plot_emiss().
## plot first 50 subjects to aid interpretability
plot_emiss(out_2st_emotion, subject_effects = TRUE, line = TRUE, subject = 1:50) # using ggmHMM function
First, we need to obtain the subject specific parameters for the
emissions and the transition probabilities. When focusing on the
transition probabilities gamma, the function obtain_gamma()
can also be used to obtain the subject specific transition
probabilities, by setting the input argument level to
"subject". Note that the default of level is
"group", so when left unspecified, the function
obtain_gamma() will return the group level transition
probabilities.
Exercise 4 Obtain the subject specific transition
probabilities using the function obtain_gamma() and set the
input argument level to "subject". Save the
output in the object gamma_subject. Inspect the output.
# obtaining and saving the subject specific transition probabilities
gamma_subject <- obtain_gamma(out_2st_emotion, level = "subject")
# inspecting the output, restring the returned output to the first 10 subjects
gamma_subject[1:10]
## $`Subject 1`
## To state 1 To state 2
## From state 1 0.598 0.402
## From state 2 0.368 0.632
##
## $`Subject 2`
## To state 1 To state 2
## From state 1 0.737 0.263
## From state 2 0.147 0.853
##
## $`Subject 3`
## To state 1 To state 2
## From state 1 0.96 0.04
## From state 2 0.40 0.60
##
## $`Subject 4`
## To state 1 To state 2
## From state 1 0.716 0.284
## From state 2 0.282 0.718
##
## $`Subject 5`
## To state 1 To state 2
## From state 1 0.922 0.078
## From state 2 0.493 0.507
##
## $`Subject 6`
## To state 1 To state 2
## From state 1 0.767 0.233
## From state 2 0.361 0.639
##
## $`Subject 7`
## To state 1 To state 2
## From state 1 0.438 0.562
## From state 2 0.141 0.859
##
## $`Subject 8`
## To state 1 To state 2
## From state 1 0.948 0.052
## From state 2 0.232 0.768
##
## $`Subject 9`
## To state 1 To state 2
## From state 1 0.615 0.385
## From state 2 0.215 0.785
##
## $`Subject 10`
## To state 1 To state 2
## From state 1 0.839 0.161
## From state 2 0.479 0.521
Note that it can be quite insightful to compare the subject specific transition probabilities to the plotted state sequences (see practical 1). Subjects with large self-transitions (i.e., probabilities that denote a transition to the same state as the current state, provided on the diagonal of the transition probability matrix) will probably remain for a long time within a state, while low self-transition probabilities most likely will translate to state sequences where a state is only visited for a short consecutive sequence.
To develop an intuition on how much the transition probabilities vary over subjects, it can be helpful to plot the subject specific state transition probabilities, for example in a heat map with one row per subject, or in stacked dots with a dot for each subject specific transition probability, and columns for each of the possible state transitions.
Exercise 5 Plot (a subset of) the subject specific transition probabilities, and investigate how much the transition probabilities differ over subjects. This can be done in 2 ways:
PlotTrans(), a barplot may be
plotted showing the group level effects, with points for the
subject-specific effects (or a subset thereof). It uses the following
arguments:
mHMMplot_gamma() from the ggmHMM
package, a heatmap may be plotted for each subject (or a subset)
separately. This can be done by
subject_effects to
TRUE, andsubject to a vector of
subject IDs (e.g., 1:10), or a single number do specify the
subject.ncol_facet to setup the layout# Transition probabilities of first subject using helper function
PlotTrans(out_2st_emotion, subject = 1:50) # plot
# Transition probabilities of first 10 subjects using ggmHMM function
plot_gamma(out_2st_emotion, level = "subject", subject = 1:10, ncol_facet = 5) # plot with facets
Exercise 6 Interpret the transition probabilities. Is there much heterogeneity between persons? Are subjects generally more likely to stay in the same state, or switch states? What does this mean, if we take the substantive interpretation of the states into account?
Note that for a two-state model, the information given by the transition probabilities is limited; The diagonals inform us about the probability of remaining in a state (and therefore the stability/duration of states), but the off-diagonals do not provide us with unique information; if a subject goes out of the state, they only have one left to go to. A three-state model may have more interesting information to provide. It could tell us, for example, whether a subject is more likely to immediately switch between a ‘good’ and a ‘bad’ state, or whether they generally move through a ‘medium’ state first.
One of the challenges when using hidden Markov models is the number
of states to choose. Here, we will limit ourselves to models with 2, 3,
and 4 states. As checking model fit is discussed at a later point in the
practical, we will investigate the different models based on the state
composition and the AIC. However, it should be noted that convergence
and model fit are important characteristics that should be considered
when choosing a model to interpret. The 2-state model was fitted in the
first practical, and saved in the object
out_2st_emotion.
Exercise 7 To save time we have already pre-run a
3-state and 4-state model. Use the function readRDS() to
load the fitted models from the files out_3st_emotion.rds
and out_4st_emotion.rds, and save them in the objects
out_3st_emotion and out_4st_emotion. The
settings we used are shown below.
out_3st_emotion <- readRDS("./data/out_3st_emotion.rds")
out_4st_emotion <- readRDS("./data/out_4st_emotion.rds")
####################
# 3 state model
####################
## general model properties
m <- 3
n_dep <- 4
## starting values for gamma
start_gamma_3st <- matrix(c(0.8, 0.1, 0.1,
0.1, 0.8, 0.1,
0.1, 0.1, 0.8), byrow = TRUE, ncol = m, nrow = m)
## starting values for the emission distribution
start_emiss_3st <- list(matrix(c(80, 10, # happy
50, 10,
20, 10), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(80, 10, # relaxed
50, 10,
20, 10), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(10, 5, # anxious
30, 10,
50, 15), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(10, 5, # depressed
30, 10,
50, 15), byrow = TRUE, ncol = 2, nrow = m))
## specifying weakly informative prior for continuous emission distributions
emotion_prior_emiss_3st <- prior_emiss_cont(
gen = list(m = m, n_dep = n_dep),
emiss_mu0 = list(matrix(c(80, 50, 20), nrow = 1), # happy
matrix(c(80, 50, 20), nrow = 1), # relaxed
matrix(c(10, 30, 50), nrow = 1), # anxious
matrix(c(10, 30, 50), nrow = 1)), # depressed
emiss_K0 = rep(list(1), n_dep),
emiss_V = rep(list(rep(5^2, m)), n_dep),
emiss_nu = rep(list(1), n_dep),
emiss_a0 = rep(list(rep(1.5, m)), n_dep),
emiss_b0 = rep(list(rep(20, m)), n_dep),
)
####################
# 4 state model
####################
## general model properties
m <- 4
## starting values for gamma
start_gamma_4st <- matrix(c(0.7, 0.1, 0.1, 0.1,
0.1, 0.7, 0.1, 0.1,
0.1, 0.1, 0.7, 0.1,
0.1, 0.1, 0.1, 0.7), byrow = TRUE, ncol = m, nrow = m)
## starting values for the emission distribution
start_emiss_4st <- list(matrix(c(80, 10, # happy
60, 10,
40, 10,
20, 10), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(80, 10, # relaxed
60, 10,
40, 10,
20, 10), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(10, 5, # anxious
20, 10,
40, 10,
60, 10), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(10, 5, # depressed
20, 10,
40, 10,
60, 10), byrow = TRUE, ncol = 2, nrow = m))
## specifying weakly informative prior for continuous emission distributions
emotion_prior_emiss_4st <- prior_emiss_cont(
gen = list(m = m, n_dep = n_dep),
emiss_mu0 = list(matrix(c(80, 60, 40, 20), nrow = 1), # happy
matrix(c(80, 60, 40, 20), nrow = 1), # relaxed
matrix(c(10, 20, 40, 60), nrow = 1), # anxious
matrix(c(10, 20, 40, 60), nrow = 1)), # depressed
emiss_K0 = rep(list(1), n_dep),
emiss_V = rep(list(rep(5^2, m)), n_dep),
emiss_nu = rep(list(1), n_dep),
emiss_a0 = rep(list(rep(1.5, m)), n_dep),
emiss_b0 = rep(list(rep(20, m)), n_dep),
)
# Fitting the 3 state model
m <- 3
set.seed(42)
out_3st_emotion <- mHMM(s_data = emotion_mHMM,
data_distr = "continuous",
gen = list(m = m, n_dep = n_dep),
start_val = c(list(start_gamma_3st), start_emiss_3st),
emiss_hyp_prior = emotion_prior_emiss_3st,
mcmc = list(J = 500, burn_in = 200))
# Fitting the 4 state model
m <- 4
set.seed(42)
out_4st_emotion <- mHMM(s_data = emotion_mHMM,
data_distr = "continuous",
gen = list(m = m, n_dep = n_dep),
start_val = c(list(start_gamma_4st), start_emiss_4st),
emiss_hyp_prior = emotion_prior_emiss_4st,
mcmc = list(J = 500, burn_in = 200))
Exercise 8 Inspect the general output of the 3- and
4- state models by using the inbuilt print() and
summary() for objects of class mHMM returned
by the function mHMM().
# 3 state model
out_3st_emotion
## Number of subjects: 125
##
## 500 iterations used in the MCMC algorithm with a burn in of 200
## Average Log likelihood over all subjects: -2823.466
## Average AIC over all subjects: 5706.932
## Average AICc over all subjects: 5710.502
##
## Number of states used: 3
##
## Number of dependent variables used: 4
##
## Type of dependent variable(s): continuous
summary(out_3st_emotion)
## State transition probability matrix
## (at the group level):
##
## To state 1 To state 2 To state 3
## From state 1 0.864 0.087 0.050
## From state 2 0.314 0.575 0.111
## From state 3 0.193 0.110 0.697
##
##
## Emission distribution ( continuous ) for each of the dependent variables
## (at the group level):
##
## $happy
## Mean SD
## State 1 68.106 14.927
## State 2 50.765 22.294
## State 3 40.561 14.148
##
## $relaxed
## Mean SD
## State 1 62.515 20.896
## State 2 45.273 25.916
## State 3 38.207 16.530
##
## $anxious
## Mean SD
## State 1 5.357 3.538
## State 2 27.273 26.663
## State 3 23.466 14.509
##
## $depressed
## Mean SD
## State 1 9.798 7.566
## State 2 36.392 26.328
## State 3 43.209 18.970
# 4 state model
out_4st_emotion
## Number of subjects: 125
##
## 500 iterations used in the MCMC algorithm with a burn in of 200
## Average Log likelihood over all subjects: -2772.047
## Average AIC over all subjects: 5632.095
## Average AICc over all subjects: 5639.906
##
## Number of states used: 4
##
## Number of dependent variables used: 4
##
## Type of dependent variable(s): continuous
summary(out_4st_emotion)
## State transition probability matrix
## (at the group level):
##
## To state 1 To state 2 To state 3 To state 4
## From state 1 0.785 0.145 0.053 0.017
## From state 2 0.234 0.664 0.063 0.038
## From state 3 0.265 0.222 0.449 0.064
## From state 4 0.040 0.179 0.063 0.718
##
##
## Emission distribution ( continuous ) for each of the dependent variables
## (at the group level):
##
## $happy
## Mean SD
## State 1 69.306 14.580
## State 2 55.936 14.678
## State 3 52.343 21.409
## State 4 33.437 13.849
##
## $relaxed
## Mean SD
## State 1 63.676 20.775
## State 2 51.315 19.587
## State 3 45.295 25.826
## State 4 32.520 15.771
##
## $anxious
## Mean SD
## State 1 2.803 2.512
## State 2 10.563 8.595
## State 3 36.242 28.347
## State 4 27.706 20.357
##
## $depressed
## Mean SD
## State 1 6.349 5.129
## State 2 27.410 17.391
## State 3 35.180 26.757
## State 4 52.370 18.392
First, we will investigate the composition of the states, and use this to evaluate whether it seems sensible to add extra states. Especially with multivariate data, inspecting the composition of the states is done most easily visually.
Exercise 9 Use the function
plot_emiss() from the ggmHMM package to visualize
the emission distributions of the three-state and four-state model. Do
not plot subject-specific effects. Does it seem to make sense to add
extra state(s)?
plot_emiss(out_3st_emotion, subject_effects = FALSE) + ylim(c(0,100)) # three-state model with helper function
plot_emiss(out_4st_emotion, subject_effects = FALSE) + ylim(c(0,100)) # four-state model with ggmHMM function
# Note: the 4th state differentiates the emotions even more.
Next, let’s compare the 2-, 3-, and 4-state model on the model selection criterion AIC.
Exercise 10 Obtain the AIC values of the 2-, 3-, and
4-state model using the print() function on the mHMM
objects out_2st_emotion, out_3st_emotion,
out_4st_emotion, and compare. It can be helpful to
visualize the AIC values over the states to get a feeling for how much
the AIC decreases over the models.
out_2st_emotion
## Number of subjects: 125
##
## 500 iterations used in the MCMC algorithm with a burn in of 200
## Average Log likelihood over all subjects: -2870.816
## Average AIC over all subjects: 5777.631
## Average AICc over all subjects: 5778.915
##
## Number of states used: 2
##
## Number of dependent variables used: 4
##
## Type of dependent variable(s): continuous
out_3st_emotion
## Number of subjects: 125
##
## 500 iterations used in the MCMC algorithm with a burn in of 200
## Average Log likelihood over all subjects: -2823.466
## Average AIC over all subjects: 5706.932
## Average AICc over all subjects: 5710.502
##
## Number of states used: 3
##
## Number of dependent variables used: 4
##
## Type of dependent variable(s): continuous
out_4st_emotion
## Number of subjects: 125
##
## 500 iterations used in the MCMC algorithm with a burn in of 200
## Average Log likelihood over all subjects: -2772.047
## Average AIC over all subjects: 5632.095
## Average AICc over all subjects: 5639.906
##
## Number of states used: 4
##
## Number of dependent variables used: 4
##
## Type of dependent variable(s): continuous
AICs <- data.frame(AIC = c(5777.631, 5706.932, 5632.095),
states = c(2:4))
ggplot(AICs, mapping = aes(x = states, y = AIC)) +
geom_point() +
geom_line() +
ggtitle("AIC over the 2 - 4 state models") +
xlab("Number of states") +
theme_minimal()
As you can see, determining the number of states can be a subjective and tricky endeavor, and depends very much on the data and research question at hand.
In this section we will inspect model diagnostics and fit, checking convergence, label switching, using posterior predictive checks (PPCs), and evaluating Pseudo residuals.
Within any Bayesian analysis, convergence needs to be assessed in order to rule out output resulting from a local (instead of global) maximum. To check this, multiple chains (now we will use 2, but we recommend using at least 4) need to be run, performing the same analysis but using (slightly) different starting values in each chain. Note that in multilevel hidden Markov models, it is important to still keep sensible starting values when varying these. For example, when modelling continuous or count data, we need to keep the same sensible ordering of \(\bar{\mu}_i\) or \(\bar{\lambda}_i\) given data and process. Values of the (weakly informative) prior distributions should be fixed over the chains.
Exercise 11 Specify new starting values to fit at least one additional chain for the 2 state multilevel HMM model. Do not yet run the model.
Exercise 12 To save time, we will provide you with
an additional chain, run with the starting values below. Use the
function readRDS() to load the fitted model from the file
out_2st_emotion_b.rds, and save it in the object
out_2st_emotion_b.
Note. The priors that we used may be slightly different from the ones you specified in your model in the first practical. For now, we will ignore this for simplicity, but it is important to use the same settings for the prior when applying the model yourself.
out_2st_emotion_b <- readRDS("./data/out_2st_emotion_b.rds")
m <- 2
n_dep <- 4
start_gamma_b <- matrix(c(0.9, 0.1,
0.1, 0.9), byrow = TRUE, ncol = m, nrow = m)
start_emiss_b <- list(matrix(c(80, 10, # happy
40, 15), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(80, 10, # relaxed
40, 15), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(5, 5, # anxious
30, 15), byrow = TRUE, ncol = 2, nrow = m),
matrix(c(5, 5, # depressed
30, 15), byrow = TRUE, ncol = 2, nrow = m))
out_2st_emotion_b <- mHMM(s_data = emotion_mHMM,
data_distr = "continuous",
gen = list(m = m, n_dep = n_dep),
start_val = c(list(start_gamma_b), start_emiss_b),
emiss_hyp_prior = emotion_prior_emiss,
mcmc = list(J = 500, burn_in = 200))
Evidence for non convergence can be checked in multiple ways. A first way is, using trace plots which visualize the parameter estimates over the iterations of the MCMC sampler. The chains should not show any trend and good mixing. Furthermore, we should have no label switching (or at least very little) in the subject-specific parameters (see the section ‘label switching’ later in the practical)..
The trace plots can be obtained using the ggmHMM function
plot_trace(). For now, the following arguments are
relevant:
mHMMExercise 13
A. Use plot_trace() to create a trace
plot for the group-level transition probabilities of the 2-state model.
Use both out_2st_emotion and
out_2st_emotion_b. Hint: you leed to put the models in a
list using list() to use them in
plot_trace().
plot_trace(model = list(out_2st_emotion, out_2st_emotion_b),
component = "gamma",
level = "group")
B. In addition to the group level parameters, it is
also important to check convergence at the subject level. Use the
function plot_trace() to create a trace plot for the
subject specific emission means of the first subject, for the variable
“happy”. Use both out_2st_emotion and
out_2st_emotion_b.
plot_trace(model = list(out_2st_emotion, out_2st_emotion_b),
component = "emiss",
level = "subject",
vrb = "happy",
subject = 1) + ylim(c(0,100))
We should also inspect convergence numerically, with the potential scale reduction factor \(\hat{R}\). \(\hat{R}\) evaluates equality of means of the different chains - the degree to which variance (of the means) between chains exceeds what one would expect if the chains were identically distributed. A value of R-hat below 1.2 is used to indicate convergence.
We have provided you with a helper function, f_GR(), to
calculate the group level \(\hat{R}\)
values. The function takes the following arguments:
It returns a list with 2 matrices:
Exercise 14 Use the function f_GR() to
calculate the \(\hat{R}\) values for
the group level transition probabilities and emission distribution
means. Did the model reach convergence?
f_GR(model_list = list(out_2st_emotion, out_2st_emotion_b),
digits = 2,
burnin = 200)
## $m_RG_gamma
## [,1]
## [1,] 1.01
## [2,] 1.06
##
## $RG_emiss
## [,1] [,2]
## [1,] 1.00 1.01
## [2,] 1.07 1.00
## [3,] 1.01 1.06
## [4,] 1.19 1.14
A concept related to convergence is that of label switching. In a multilevel HMM, it does not matter for the model what the ordering of the states is. For example, a model where State 1 is the ‘good’ state (e.g. with high “happy” and low “depressed”), and state 2 the ‘bad’ state, or whether it is the other way around. In the iterative MCMC process, the model may switch the labels of the states over iterations, and therefore the interpretation of the states. This is called label switching. In the mHMM, label switching happens at the subject level, but consequently also affects estimates of the group-level effects. It is important to check to what extent this happens.
To check for label switching, it is helpful to use trace plots for
the subject-level parameters, similar to those from last exercise. It
may, however, be useful here to show the trace plots from multiple
states in a single plot, to see whether the states switch labels. This
can be done with the helper function
plot_label_switching().
plot_label_switching() takes as arguments: * model:
output object from mHMM() * subject: a vector of integers
to select the subjects to create plots for. * vrb: a character vector to
specify the variable to plot.
Exercise 15 Use the function
plot_label_switching() to evaluate label switching for
subject 1 to 5, for the 3-state model. Is there a label switching issue
for these persons? And for subjects 6 to 10?
plot_label_switching(out_3st_emotion, subject = 1:5)
plot_label_switching(out_3st_emotion, subject = 6:10)
Exercise 16 The estimate of subject 9’s state specific means for each variable is calculated by taking the mean (or median) of all sampled values after the specified burnin period. How does the label switching issue affect the estimates of this subject’s state-specific means?
Finally, we will assess whether the model recovers the data correctly
with respect to the mean and standard deviation, to reveal potential
model misspecification. It is particularly helpful for group-level model
evaluation. First, we need to generate simulated data sets based on
obtained parameter estimates. The function sim_mHMM() can
be used to simulate data using a multilevel hidden Markov model. Note
that for each simulated dataset, new subject-specific parameters are
simulated, but the group-level parameters remain fixed.
The function takes the following arguments (for more detailed
information, use help(sim_mHMM()):
Note. In the data, there is some observed missingness, including the night gap. For now, we simulate data without the night gap and ignore the missingness. Although it will likely make little difference, it may be sensible to insert missing values in places where the data are missing when performing PPCs on your own data.
Exercise 17 Use the starter script below to simulate 100 datasets based on the model parameters of the 2-state multilevel HMM.
n_sim <- 100 # for now, we will use 100 simulated datasets, but feel free to change.
sim_datasets <- vector("list", n_sim)
m <- ... # specify the number of states
n_dep <- ... # specify the number of dependent variables used in mHMM()
# calculate variance in emission distribution
sapply(out_2st_emotion$emiss_varmu_bar, apply, 2, mean, na.rm = TRUE)
for(i in 1:n_sim){
sim_datasets[[i]] <- sim_mHMM(n_t = 240 , # specify the number of beeps per subject
n = 125, # specify the number of subjects
data_distr = "continuous",
gen = list(m = m, n_dep = n_dep),
gamma = ... , # specify the obtained transition prob matrix
emiss_distr = ... , # specify the emission distribution, see help file
var_gamma = 0.01,
var_emiss = c(300, 300, 200, 200)
)
}
n_sim <- 100
sim_datasets <- vector("list", n_sim)
m <- 2
n_dep <- 4
# calculate variance in emission distribution
sapply((out_2st_emotion$emiss_varmu_bar), apply, 2, mean, na.rm = TRUE)
## happy relaxed anxious depressed
## varmu_1 323.8858 349.0863 77.74446 174.3764
## varmu_2 324.4923 271.7051 441.86128 314.2728
set.seed(42)
for(i in 1:n_sim){
sim_datasets[[i]] <- sim_mHMM(n_t = 240,
n = 125,
data_distr = "continuous",
gen = list(m = m, n_dep = n_dep),
gamma = unname(obtain_gamma(out_2st_emotion)),
emiss_distr = obtain_emiss(out_2st_emotion),
var_gamma = 0.01,
var_emiss = c(300, 300, 200, 200)
)
}
Exercise 18 For each simulated data set, calculate the summary statistics of interest for all four variables (i.e., mean and standard deviation).
means_simdata <- matrix(, nrow = n_sim, ncol = n_dep)
for(i in 1:n_sim){
means_simdata[i,] <- apply(sim_datasets[[i]]$obs[,2:(n_dep + 1)], 2, mean)
}
sd_simdata <- matrix(, nrow = n_sim, ncol = n_dep)
for(i in 1:n_sim){
sd_simdata[i,] <- sqrt(apply(sim_datasets[[i]]$obs[,2:(n_dep + 1)], 2, var))
}
Now that we have simulated datasets and obtained the summary characteristics, we will visually compare the characteristics of the simulated data to that of the observed data.
Exercise 19 Plot the summary characteristics (mean and sd) of the simulated datasets in histograms, and add a vertical line for the mean and sd in the observed data to aid the comparison.
vars <- names(obtain_emiss(out_2st_emotion))
par(mfrow = c(2,4))
for(i in 1:n_dep){
hist(means_simdata[,i],
main = vars[i],
xlab = paste("Mean", vars[i])
)
abline(col = "red", lwd = 2, v = mean(emotion_mHMM[,1+i], na.rm = TRUE))
}
# plot histogram sd simulated data and observed in real data
for(i in 1:n_dep){
hist(sd_simdata[,i],
main = vars[i],
xlab = paste("SD", vars[i]),
xlim = c(20,30)
)
abline(col = "red", lwd = 2, v = sqrt(var(emotion_mHMM[,1+i], na.rm = TRUE)))
}
# The means show no evidence of bad model fit, however the standard deviations of the simulated datasets are higher compared to the observed data, especially for happy and relaxed. It would be interesting to see whether a 3-state model would partly solve this.
Where PPCs are most useful for group-level model evaluation,
pseudoresiduals are most useful for model evaluation at the subject
level. These are residuals that are obtained based on the most likely
state sequence for each person (hence the suffix pseudo),
using the Viterbi algorithm (see previous practical).
We can compute the Root-Mean-Squared-Error (RMSE) of the pseudoresiduals, which is a measure of how well the model fits the data, where a lower RMSE indicates a better fit. This is especially useful to compare models.
We have provided you with two helper functions for this.
GetResid() can be used to obtain residuals for a specific
person for a specific variable. The function takes the following
arguments:
It outputs a list with 4 elements:
Exercise 20
A. Use GetResid() to obtain
pseudoresiduals for subject 1 for all variables. Do this for both the
2-state and 3-state model. Hint: use the function lapply()
to loop over the variables.
pseudoresid_2st <- lapply(c("happy", "relaxed", "anxious", "depressed"), # for variable 1 to 4
function(j) GetResid(data = emotion_mHMM, model = out_2st_emotion, vrb = j, subject = 1))
pseudoresid_3st <- lapply(c("happy", "relaxed", "anxious", "depressed"), # for variable 1 to 4
function(j) GetResid(data = emotion_mHMM, model = out_3st_emotion, vrb = j, subject = 1))
B. Obtain the RMSE for the pseudoresiduals of subject 1 for all variables, and compare the RMSE of the 2-state and 3-state model. Which model fits better for this person?
RMSE_2st <- sapply(pseudoresid_2st, function(x) x$RMSE)
RMSE_3st <- sapply(pseudoresid_3st, function(x) x$RMSE)
print(RMSE_2st)
## [1] 20.04446 22.32644 24.64004 22.43806
print(RMSE_3st)
## [1] 14.58216 21.99121 24.38652 20.12457
## The 3-state model has slightly better fit for subject 1 as it has the lowest RMSE for all variables.
After obtaining the pseudoresiduals, we can also plot them separately for each person and each variable to assess: 1. Whether the pseudoresiduals are (roughly) normally. 2. Whether there is a trend over time. 2. Whether there is any remaining dependency between residuals over time (can also be assessed using the autocorrelation). 3. Whether there are any outliers. 4. Whether variance is (roughly) stable over time.
We have provided you with a helper function,
PlotResid(), to plot the pseudoresiduals. It takes the
following arguments:
mHMM()Exercise 21 Use PlotRes() to plot the
pseudoresiduals for subject 1 to 3 to for the variable
happy, for the 3-state model. How do you evaluate the model
fit for these subjects?
PlotRes(out_3st_emotion,
data = emotion_mHMM,
vrb = "happy",
subject = 1:3)